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In this work we perform a meta-analysis of neuroimaging data, 
consisting of locations of peak activations identified in 162 separate 
studies on emotion. Neuroimaging meta-analyses are typically per- 
formed using kernel-based methods. However, these methods require 
the width of the kernel to be set a priori and to be constant across 
the brain. To address these issues, we propose a fully Bayesian non- 
parametric binary regression method to perform neuroimaging meta- 
analyses. In our method, each location (or voxel) has a probability 
of being a peak activation, and the corresponding probability func- 
tion is based on a spatially adaptive Gaussian Markov random field 
(GMRF). We also include parameters in the model to robustify the 
procedure against miscoding of the voxel response. Posterior inference 
is implemented using efficient MCMC algorithms extended from those 
introduced in Holmes and Held [Bayesian Anal. 1 (2006) 145-168]. 
Our method allows the probability function to be locally adaptive 
with respect to the covariates, that is, to be smooth in one region 
of the covariate space and wiggly or even discontinuous in another. 
Posterior miscoding probabilities for each of the identified voxels can 
also be obtained, identifying voxels that may have been falsely clas- 
sified as being activated. Simulation studies and application to the 
emotion neuroimaging data indicate that our method is superior to 
standard kernel-based methods. 

1. Introduction. 

1.1. Meta-analysis of neuroimaging studies. In recent years there has 
been a rapid increase in the number and variety of neuroimaging studies 
being performed around the world. This growing body of knowledge is ac- 
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companied by a need to integrate research findings and establish consistency 
across labs and scanning procedures, and to identify consistently activated 
regions across a set of studies. Performing meta-analyses has become the 
primary research tool for accomplishing this goal [Wager, Lindquist and 
Kaplan (2007); Wager et al. (2009)]. Evaluating consistency is important 
because false positive rates in neuroimaging studies are likely to be higher 
than in many fields, as many studies do not adequately correct for multiple 
comparisons. Thus, some of the reported activated locations are likely to 
be false positives, and it is important to assess which findings have been 
replicated and have a higher probability of being real activations. Individual 
imaging studies often use very different analyses [see Lindquist (2008) for an 
overview] , and effect sizes are only reported for a small number of activated 
locations, making combined effect-size maps across the brain impossible to 
reconstruct from published reports. Instead, meta-analysis is typically per- 
formed on the spatial coordinates of peaks of activation (peak coordinates), 
reported in the standard coordinate systems of the Montreal Neurologic 
Institute (MNI) or Talairach and Tournoux (1988), and combined across 
studies. These peak coordinates typically correspond to the voxel whose 
t-statistic takes the maximum value in a spatially coherent cluster of acti- 
vation, that is, the max statistic among a set of adjacent voxels that ex- 
ceed a certain threshold. This information is typically provided in most 
neuroimaging papers and simple transformations between the two standard 
spaces exist. 

A typical neuroimaging meta-analysis studies the locations of peak ac- 
tivations from a large number of studies and seeks to identify regions of 
consistent activation. This is usually performed using kernel-based methods 
such as activation likelihood estimation [ALE; Turkeltaub et al. (2002)] or 
kernel density approximation [KDA; Wager, Jonides and Reading (2004)]. In 
both methods, maps are created for each study by convolving an indicator 
map, consisting of an impulse response at each study peak, with a kernel 
of predetermined shape and width. The resulting maps are thereafter com- 
bined across studies to create a meta-analysis map. Monte Carlo methods 
are used to find an appropriate threshold to test the null hypothesis that 
the n reported peak coordinates are uniformly distributed throughout the 
grey matter. A permutation distribution is computed by repeatedly gener- 
ating n peaks at random locations and performing the smoothing operation 
to obtain a series of statistical maps under the null hypothesis that can be 
used to compute voxel-wise p- values. The two approaches differ in the shape 
of the smoothing kernel. In KDA, it is assumed to be a sphere with fixed 
radius, while in ALE it is a Gaussian with fixed standard deviation. 

A major shortcoming of kernel-based approaches is that the width of the 
kernel, and thus the amount of smoothing, is fixed a priori and assumed 
to be constant throughout the brain. In order to address these concerns, 
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Fig. 1. Example of the raw data are shown for a representative sagittal, coronal and 
axial slice of the brain. Each point represents a reported activation foci in an individual 
study by criteria designated by that particular study. All foci are reported and plotted in 
the MNI brain template to allow for cross study comparisons. 

we propose a fully Bayesian nonparametric binary regression method for 
performing neuroimaging meta-analysis. In our method, each location has 
a probability of being a peak activation, and the corresponding probability 
function is based on a spatially adaptive Gaussian Markov random field 
(GMRF). The locally adaptive features of our method allows us to better 
match the natural spatial resolution of the data across the brain compared 
to using an arbitrary chosen fixed kernel size. 

In this work, a meta-analysis was performed on the results of 162 neu- 
roimaging studies (57 PET and 105 fMRI) on emotion. The studies were 
all performed on healthy adults and published between 1990 and 2005. For 
each study, the foci of activation were included when reported as signifi- 
cant by the criteria designated in the individual studies. Relative decreases 
in activation in emotion related tasks were not analyzed. All coordinates 
were reported on the MNI coordinate system to allow for cross study com- 
parisons. Together, these studies yield a data set consisting of 2478 unique 
peak coordinates. This data set is described in greater detail in Kober et al. 
(2008). Due to the relative scarcity of neuroimaging studies on a particular 
topic (e.g., emotion), it is standard practice in meta-analysis to combine 
data obtained using different imaging modalities, sample sizes and statisti- 
cal analyses. This is done to ensure that the analysis has enough power to 
detect effects of interest. In addition, studies in Wager et al. (2008) have 
shown no significant difference between MRI and PET in the assessment of 
their functional maps and their foci of activation. Figure 1 shows the raw 
data for representative slices of the brain with fixed x, y and z directions, 
respectively. Each point in the plot represents the location of the peak of 
a cluster of reported activation from one of the 162 neuroimaging studies. 
The primary goal for analyzing this data set was to determine areas of the 
brain that are consistently active in studies of emotion. 
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1.2. Statistical modeling for binary response. Let Y be a random binary 
response variable, X a vector of covariates and p(x) the response probabil- 
ity function, f?(x) = Pr(y = 1\X = x). In the context of fMRI meta-analysis, 
Y = 1 if the voxel is reported as being a peak activation. The vector X in- 
cludes the voxel location and possibly other covariates related to the patient 
or the study. In nonparametric binary regression, we have p(x) = H(z(pc)), 
where H is a specified cumulative distribution function often referred to 
as the link function. Popular link functions are the standard logistic and 
standard normal cumulative density functions. 

The traditional parametric approach to binary regression involves set- 
ting z(x) = a + /3 T x, with unknown parameters a and j3. McCullagh and 
Nelder (1989) contains a comprehensive treatment of frequentist parametric 
methods with exponential family models, binary regression being a special 
case. Bayesian binary regression is well documented in, for example, Dey, 
Ghosh and Mallick (2000). In particular, Albert and Chib (1993) and Holmes 
and Held (2006) introduced auxiliary variable methods that provide efficient 
Markov chain Monte Carlo (MCMC) inference for parametric binary regres- 
sion. 

There is an extensive non-Bayesian literature on nonparametric regression 
using exponential family models, with binary regression treated as a special 
case. O'Sullivan, Yandell and Raynor (1986) estimated a single function us- 
ing a penalized likelihood approach, and their work was extended to additive 
models by Hastie and Tibshirani (1990). Gu (1990) and Wahba et al. (1995) 
used tensor product smoothing splines to allow for interactions between vari- 
ables and estimated smoothing parameters via a generalized cross-validation 
technique. Loader (1999) proposed a local likelihood approach for both uni- 
variate and bivariate nonparametric estimation and provided data-driven 
bandwidth estimators. 

Bayesian methods for nonparametric binary regression were developed in 
Wood and Kohn (1998), Holmes and Mallick (2003), Choudhuri, Ghosal and 
Roy (2007), and Trippa and Muliere (2009). These methods are not locally 
adaptive, however. Krivobokova, Crainiceanu and Kauermann (2008) pro- 
posed an adaptive penalized spline estimator for binary regression based on 
quasi- likelihoods. Wood et al. (2008) presented a locally adaptive Bayesian 
estimator for binary regression by using a mixture of probit regressions where 
the argument of each probit regression is a thin-plate spline prior with its 
own smoothing parameters and the mixture weights depend on the covari- 
ates. 

In fMRI meta-analysis, kernel-based smoothing techniques are typically 
used to identify regions of consistent activation and Monte-Carlo proce- 
dures are used to establish statistical significance. These techniques count 
the number of activation peaks within a radius of each local brain area and 
compare the observed number to a null distribution to establish significance. 
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The kernel radius is chosen by the analyst, and kernels that match the natu- 
ral spatial resolution of the data are the most statistically powerful [Wager, 
Lindquist and Kaplan (2007)]. In our method, the function z(-) is assumed 
to be a spatially adaptive Gaussian Markov random field (GMRF) with lo- 
cally varying variance. The local adaptiveness of the procedure allows the 
probability function to be smooth in some regions and wiggly in others, de- 
pending on the data information. The need of adaptive smoothing for fMRI 
data has been demonstrated in Brezger, Fahrmeir and Hennerfeind (2007) 
and Yue, Loh and Lindquist (2010). The proposed Bayesian nonparamet- 
ric binary regression method is an extension to the binary response case of 
methods developed in Yue and Speckman (2010) and Yue and Loh (2011). 
To make this procedure better suited for application to fMRI meta-analysis, 
we incorporate additional model parameters associated with the probabil- 
ities of voxels being miscoded. This makes the modeling more robust to 
possible errors in the data. The posterior inference is carried by efficient 
MCMC algorithms extended from those in Holmes and Held (2006). From 
the model fit we obtain a map of the probability of observing a peak activa- 
tion across the brain as well as posterior miscoding probabilities. Regions of 
the brain with high probability estimates are identified as activated based on 
the meta-analysis. This makes the proposed method far more interpretable 
than earlier approaches. 

The rest of the paper is organized as follows. The proposed method is 
described in Section 2. Section 3 presents simulation studies comparing our 
method to other available methods. Results of the data analysis are given 
in Section 4. Section 5 concludes this work with discussions. 

2. Bayesian hierarchical modeling and inference. We describe in this sec- 
tion our nonparametric binary regression model using the spatially adaptive 
GMRF. Note that our method currently can only be implemented in two 
dimensions. We apply it to the fMRI setting by fitting the model to brain 
slices in succession. This is similar to the staggered approach in Penny, 
Trujillo-Barreto and Friston (2005), who used a two-dimensional Laplacian 
prior that is related to our GMRF prior. 

2.1. Spatially adaptive GMRF on regular lattice. Let us denote by x = 
! 2-21] . . . , x ni ^ n2 ) an Tridimensional vector of voxel locations on a regular 
ni x n 2 lattice (n = 111112) ■ Adopting notation Zjk = z(xjk), we assume that 
the underlying spatial process Zjk is an adaptive Gaussian Markov random 
field (GMRF) as introduced in Yue and Speckman (2010). This adaptive 
GMRF is based on the following spatial Gaussian random walk model: 



(1) 
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where ^ and V^ ^ denote the second-order backward difference opera- 
tors in the vertical and horizontal directions respectively, that is, Q ^Zjk = 
z j+ i,k - 2z jk + Zj-i t k and V^^Zjk = z jtk +i - 2z jk + z j)k -i for 2 < j < ni - 1 
and 2 < j < 122 — I. The parameter 5 2 is a global smoothing parameter ac- 
counting for large-scale spatial variation while 7? fc are the adaptive smooth- 
ing parameters that capture the local structure of the process z(x). The 
equation (1) essentially defines an adaptive smoothness prior on the second- 
order difference (V^ ^ + V^ ^)zjk- As a result, the conditional distribution 
of each Zjk given the rest z_j& is Gaussian and only depends on its neighbors 
in a specific way. This dependence can be shown using a graphical notation 
by expressing the conditional expectation of an interior zj^ as 

(ooooo ooooo o o • o o \ 

o o • o o O • O • O OOOOO) 

8o • o • o — 2ooooo— lioooi , 

o o • o o o • o • o oooool 

ooooo ooooo o o • o o / 

where the locations denoted by a "•" represent those values of z-jk that the 
conditional expectation of Zjk depends on, and the number in front of each 
grid denotes the weight given to the corresponding "•" locations. Therefore, 
the conditional mean of Zj^ is a particular linear combination of the values 
of its neighbors, and its conditional variance is V&r(zjk\z-jk) = 205 2 7? fc . 

The use of 7? fc is important for estimating activation probabilities in 
a fMRI meta-analysis. To identify consistently activated regions across a set 
of studies, we need less smoothing (large 7? fc ) where there are many re- 
ported peak locations and relatively more smoothing (small 7? fe ) where very 
few or no peaks are reported. Standard smoothing techniques (e.g., kernel 
smoother with fixed width) suffer from a trade-off between increased de- 
tectability and loss of information about the spatial extent and shape of the 
activation areas. Adaptive smoothing provided by 7? fc can reduce such loss 
of information. The need of adaptive smoothing for processing fMRI imaging 
data was also demonstrated in Brezger, Fahrmeir and Hennerfeind (2007) 
and Yue, Loh and Lindquist (2010). Note that setting 7? fc = 1 makes (1) 
a nonadaptive GMRF on lattice, yielding a Bayesian solution for thin-plate 
splines [see Rue and Held (2005), section 3.4.2]. 

Additional priors need to be specified for 7? fc in (1). We use independent 

inverse gamma priors for 7? fc , that is, ^J^ 1 ~ ' Gamma(i//2, 1/2), ^ > 0. The 
marginal prior distribution of the increment in (1) turns out to be a Student-t 
distribution with u degrees of freedom. We choose a Cauchy distribution 
{y = 1), which has been suggested as a default prior for robust nonpara- 
metric regression [Carter and Kohn (1996)] and sparse Bayesian learning 
[Tipping (2001)]. Yue and Loh (2011) and Brezger, Fahrmeir and Henner- 
feind (2007) also suggested similar priors for 7? fc in their work on adaptive 



BAYESIAN META-ANALYSIS OF FMRI DATA 



7 



spatial smoothing. Yue and Speckman (2010) and Yue, Loh and Lindquist 
(2010), however, assumed another spatial GMRF model for log(7| fc ) in a sec- 
ond hierarchy. Although it has been applied successfully for modeling spatial 
data, this two-stage GMRF prior forces the 7? fe to be smooth and it is not 
suitable for estimating spatial processes with jumps or sharp peaks. Further- 
more, the computation is rather complicated, precluding extensions to more 
flexible regression models, for example, the binary hierarchical regression 
model considered here. 

The prior for 5 2 is often chosen to be a conjugate diffuse but proper inverse 
gamma prior. We, however, propose to use a half-i distribution as the prior 
for its square root, that is, 

(3) [%S]cx(l + -(-) J , 5>0, 

where p is the parameter of degrees of freedom and S is the scale parameter. 
The half-t distribution can be treated as the absolute value of a Student-i 
distribution centered at zero [see Psarakis and Panaretos (1990)]. Although 
it is not commonly used in statistics, the half-t distribution was used in 
objective Bayesian inference by Wiper, Giron and Pewsey (2008) and sug- 
gested for use as a default prior for a variance component in hierarchical 
models [e.g., Gelman (2006); Gelman et al. (2008)]. This family includes, 
as special cases, the improper uniform density (if p = —1) and the proper 
half-Cauchy (if p= 1). Following Carvalho, Poison and Scott (2010), we use 
a standard half-Cauchy prior {p = S = 1) due to its heavy tail and substan- 
tial mass around zero. Although it is not conjugate, the half-t prior on 5 

can be written as S = |£|6>, where £ ~ N(0, 1) and 6 2 ~ IG(p/2, pS 2 /2) [e.g., 
Psarakis and Panaretos (1990)]. This property enables us to develop efficient 
MCMC sampling schemes as shown in the Appendix. 

2.2. Posterior inference. Although any cumulative distribution function 
(cdf) H that preserves the smoothness of z may be used as a link function, 
here, we only consider the case in which the H can be represented as the 
scale mixture of mean zero normal cdf 's. Two special examples are the well- 
known probit and logit link functions. With a specific link function, the 
posterior distribution of z is not analytically tractable, and thus an MCMC 
algorithm will be used to compute the posterior distribution. The algorithm 
is based on the auxiliary variable method in Holmes and Held (2006) and 
GMRF simulation techniques in Rue and Held (2005). Briefly, the data are 
augmented by introducing an auxiliary variable W{ that follows a normal 
distribution with mean Zj and variance A«. The new data Wi are associated 
with original binary data yi in the following way: yi = 1 if W{ > and y« = 
if Wi < 0. Then, the adaptive GMRF prior is taken on zi and a certain 
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prior distribution chosen for Aj depending on the link function. The full 
conditional distributions for the Gibbs sampler are all easily derived and 
can be efficiently sampled. In the Appendix we provide the detailed MCMC 
algorithms for the link functions that are the probit, logit and general scale 
mixture of normals. 

2.3. Robustification. In this section we describe how to robustify our 
procedure against miscoding of the response variable. Adopting the idea in 
Choudhuri, Ghosal and Roy (2007), we use indicator variables if) = (ipi, . . . , 
ip n )' such that ipi = 1 indicates that is miscoded and ipi = indicates 
that yi is correctly coded. In the context of fMRI meta-analysis, ipi = \ means 
that yi is either a false positive or a false negative. Since these variables 
cannot be observed, we treat them as unknown parameters that need to 
be estimated via taking priors on them. The joint posterior distribution of 
(ip, z) is then used to obtain a robust estimate of z, and also to identify the 
miscoded observations. 

We assume that each observation has equal probability of being miscoded, 
independent of other observations and z. Denote by r an a priori guess for 
the probability of an observation being miscoded. Given (ip,z), the y^s are 
independent Bernoulli random variables with probability of success (1 — 
ipi)H{zi) + ipi{\ — H{zi)). As a result, the conditional distributions of ipi are 
independent with 

(4) P(^ = l|y,z) = { 



r[l-H( Zi )] 



r[l-H(zi)} + (l-r)H{ziY 
rH{ Zi ) 



if Vi 



if y t = 0. 



(5) (wi\ip,c,v,y) 



rH{z i ) + (l-r)[l-H(z l )Y 

Consider the probit link without any hyperprior. As shown in Section A.l, 
we adjust latent variables Wi for miscoding, that is, yi = 1 if {wi > 0,ipi = 0} 
or {wi <0,ipi = l}. Then, 

(N^ Vi ,l)I{ Wi >0), if yi + ^ = l, 
\N(£7?i,l)/(™i<0), + 

Hence, samples from the joint distribution (ipi, Wi\z, y) can be drawn by 
first sampling ipi using (4) and then sampling Wi using (5). Since the full 
conditional of z does not depend on ip or y, the samples from the conditional 
distributions of the rest of the parameters can be drawn as described earlier. 
Note that the algorithm of this robust approach may be extended similarly 
to the logit link or an arbitrary symmetric link by introducing the relevant 
latent variables. 



3. Simulation studies. We performed two different types of simulation 
studies to investigate the performance of our method. The first simulation 
is in the setting of nonparametric binary regression, where the proposed 
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1.0 °-° Probit Logistic FAPS 



(c) (d) 

Fig. 2. Simulation I: (a) True probability function; (b) Estimated probability function 
using proposed method with probit link; (c) Estimated probability function using FAPS 
method; (d) Mean squared probability errors using proposed method and FAPS method. 



method is compared to an adaptive penalized spline model. The second 
simulation is in the setting of fMRI meta-analysis, where our method is 
compared to the kernel-based ALE method, which is commonly used in 
neuroimaging meta-analysis. 



3.1. Simulation I. The true probability function is assumed to be 



p(x) = <!>< 6exp 



-((x 1 -2) 2 + (x 2 -2) 



+ 3exp 



1 

10 



(xj + x 2 2 ) 



It is a smooth bimodal spatial surface on a 30 x 30 regular lattice as shown 
in Figure 2(a). One hundred data sets were simulated and we use the mean 
squared probability error (MSPE), 



1 n 

MSPE = -V{p(x l )-p(x l )} 2 , 
n ' 



i=l 



to measure performance, where p(-) is the estimated probability function. 
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The estimates obtained using our Bayesian nonparametric binary regres- 
sion model are compared to those obtained using the fast adaptive penalized 
splines (FAPS) model in Krivobokova, Crainiceanu and Kauermann (2008). 
The FAPS approach models the regression function as a penalized spline 
with a smoothly varying smoothing parameter function which is also mod- 
eled as a penalized spline. Their method handles local smoothing of binary 
data as a special case. The authors showed that the FAPS estimator outper- 
formed the penalized spline estimators in Crainiceanu et al. (2007) and Rup- 
pert and Carroll (2000). The model can be fit using the AdaptFit R package. 

Panels (b) and (c) in Figure 2 show typical fits for the bimodal func- 
tion using our method and FAPS method, respectively. It appears that the 
FAPS model has difficulty capturing the sharp peak and undersmoothes 
the flat portion as well. Figure 2(d) shows the distributions of the MSPE 
produced by those two methods, where the FAPS estimator is apparently 
outperformed. Also, in our method the two link functions yield similar per- 
formances in terms of MSPE. This is because nonparametric modeling of z 
makes the model robust against the choice of the link function. We believe 
that the underperformance of FAPS stems from using slowly varying func- 
tions to model local smoothing parameters. Although they provide compu- 
tational efficiency, such low-rank basis functions are unable to capture sharp 
changes in the function. Yue and Speckman (2010) presented similar results 
for normal response variables. Note that the robustification procedure is not 
required in this simulation study. 

3.2. Simulation II. In the second simulation study we began by con- 
structing a 64 x 64 probability map, denoted p(x,y), where the value at 
each voxel location (x,y) represents the probability that it be recorded as 
a "peak coordinate" in a neuroimaging study. The probability map consisted 
of two circular regions of heightened probability (see Figure 3A) , where the 
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Fig. 3. (A) The probability map used to generate random activation peaks; (B) One set 
of simulated activation peaks. 
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maximum probability is roughly 0.4. Voxels lying outside these two regions 
were set to have a constant background probability of 0.01, thus allowing 
for the possibility of "false positives" outside the two centers of activation. 
Next, the probability map was used to generate random activation peaks. 
The voxel at coordinate (x,y) was considered a reported peak according to 
a binomial distribution with probability of activation p(x,y). This process 
was repeated 100 times and each time gives rise to simulated meta-analysis 
data. Figure 3B shows the data for one repetition. The data shows clear clus- 
tering around the two regions of activation, while still allowing for spurious 
activations in the rest of the image. This corresponds with the behavior of 
standard meta-analysis data (see, e.g., Figure 1). 

Each of the 100 repetitions were analyzed using the kernel-based ALE 
method as well as our Bayesian nonparametric binary regression model. In 
the former, a kernel with bandwidth 10 mm full width at half maximum 
(FWHM) was used, as this is the standard in the field. A Monte Carlo 
procedure was used to determine the appropriate threshold to test the null 
hypothesis that the reported peak coordinates are uniformly distributed 
throughout the grey matter. A permutation distribution is computed by re- 
peatedly generating peaks at random locations and performing the smooth- 
ing operation to obtain a series of statistical maps under the null hypothesis 
that can be used to determine which voxels had p-values below a, where 
a was set to 0.05 and 0.01. Regarding our Bayesian method, the robustifi- 
cation procedure described in Section 2.3 is implemented since we use the 
background probability of 0.01 to produce the false positives. To see how 
sensitive the results are to the use of robustification, we fit the model with 
prior miscoding probability r = (no robustification), r = 0.01 and r = 0.05. 
Figures 4A and B show the proportion of times each voxel was deemed sig- 
nificant at the 5% level and the 1% level, respectively, in the 100 repetitions, 
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Fig. 4. Proportion of times each voxel was deemed significant at the 5% level (A) and 
the 1% level (B) using the ALE method. 
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when the ALE method was used. It is clear that the kernel smoother does 
a very good job of finding true positives, but tends to have a large number 
of false positives in the area immediately surrounding the activated regions. 
Figure 5 shows the average probability of activation in each voxel obtained 
using our method. The maps in the left column are not thresholded, while 
those in the right column are thresholded at 0.01. Apparently, our estimates 
are closer to the simulated probability map and produce much fewer false 
positives than the kernel estimates. Furthermore, our method yields fewer 
false positives as the value of r, the prior miscoding probability, increases, 
that is, the fit becomes more robust. The spatial extent of the activation 
region, however, is not significantly shrunk, making a strong case for the use 
of adaptive smoothing. 

3.3. Computational performance and MCMC diagnostics. Thanks to the 
sparse structure of the adaptive GMRF prior used, the proposed models 
provide fast MCMC computation for nonparametric binary regression. To 
complete 5000 iterations on a 3.06 GHz Intel iMac desktop with 4GB mem- 
ory, it took the probit model 9.23, 46.06 and 11.17 seconds at sample size 
n = 30 x 30, 60 x 60 and 90 x 90, respectively, for estimating the bimodal 
function in Simulation I. The logistic model is a little slower, taking 11.89, 
55.83 and 138.69 seconds to finish the same amount of computations. The 
computing times of both models increase with sample sizes at order n, 
roughly. The programs were written in the FORTRAN language, making 
use of the LAPACK and BLAS packages. 

It is well known that the GMRF z are strongly dependent on each other 
as well as on the auxiliary variable w [see, e.g., Rue and Held (2005); Holmes 
and Held (2006)]. Those posterior correlations are likely to cause slow mix- 
ing in the Markov chain. To combat this issue, we sampled z as a block and 
employed the joint updating tricks as used in Holmes and Held (2006) (see 
the Appendix for details). Since the computation is fast, we also suggest run- 
ning a relatively large number of MCMC iterations and applying a thinning 
factor of I by collecting samples after every I iterations. In Simulation I, for 
instance, we found that it is sufficient to run 15,000 MCMC iterations (5000 
burn-in and 10,000 sampling) with a thinning factor of 10 to obtain reliable 
estimates. Figure 6 shows typical trace plots and autocorrelation functions 
of the samples of different variables for Simulation I. As we can see, the 
mixing of the chain is satisfactory for both probit and logistic models. 

4. Data analysis. We describe here the results of our meta-analysis of 
the fMRI data. As mentioned before, the data consists of the coordinates 
of 2478 peaks representing the locations of voxel activations, collected from 
162 neuroimaging studies. The raw data consists of a three-dimensional im- 
age with dimensions 91 x 109 x 91 whose elements took the value 1 if an 
activation had been reported at that voxel and otherwise. Figure 1 shows 
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Fig. 5. Average probability of activation in each voxel obtained using the adaptive GMRF 
method combined with the robustification procedure under different prior miscoding prob- 
abilities: r — (top row), r = 0.01 (middle row) and r = 0.05 (bottom row); The maps in 
the left column are not thresholded, while those in the right column are thresholded at 0.01. 



14 



Y. R. YUE, M. A. LINDQUIST AND J. M. LOH 




200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000 




Fig. 6. Assessment of MCMC convergence for Simulation I. The top (bottom) two rows 
contain the typical trace plots and autocorrelation functions of the samples of variables z, 
7 and w from fitting a probit (logistic) model. 

the raw data for a representative slice of the brain with fixed x, y and z 
directions, respectively. 

The binary nature of the meta-analysis data makes it an ideal candidate 
for our Bayesian nonparametric binary regression method. As our method is 
currently only implemented in two dimensions, we fit our method slice- wise 
across the brain for each orientation (i.e., for the fixed x, y and z direction). 
Prior to performing our method on a slice, we applied smoothing in the fixed 
direction by including all activations located within 10 mm of the slice of 
interest. 

In our simulation studies (Section 3), we found that the binary regression 
model is not sensitive to the choice of link function. We therefore fit a probit 
model to the data for computational efficiency. To make our estimation 
robust against false positives, we incorporated the robustification procedure 
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Fig. 7. Assessment of MCMC convergence for the data analysis. The top row contains 
the typical trace plots of the samples of variables z, 7 and w; the bottom row contains the 
corresponding autocorrelation functions. 

(Section 4) in the model with prior miscoding probability r = 0.01 for every 
voxel. Due to the high dimension of the data, the MCMC was run for 60,000 
iterations with 10,000 burn-in and a thinning factor of 50 iterations, resulting 
in posterior samples of size 1000. The Markov chains mix well as shown in 
Figure 7. 

Once the Bayesian binary regression model was fit, posterior probabil- 
ity maps were obtained indicating the probability of being a location of 
peak activation across the brain. Regions with probability values higher 
than 0.3 were color-coded and superimposed onto an anatomical reference 
image. The relatively low threshold is indicative of the dispersion of foci 
locations in the data. Figure 8 shows results for the three slices described 
above. Key regions of activation observed in the figure include the thala- 
mus (8A), amygdala (8B) and the ventral striatum (8C). These regions are 




Fig. 8. Thresholded posterior probability maps are shown for the sagittal, coronal and 
axial slice of the brain depicted in Figure 1. Regions with posterior probability of observing 
a peak activation higher than 0.30 are color-coded. 
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Fig. 9. Miscoding probabilities are shown for the sagittal, coronal and axial slice of the 
brain depicted in Figure 1. Points with posterior miscoding probability higher than 0.10 
are color-coded. 

known to be associated with emotion, and were also indicated as active when 
using kernel-based methods [see Kober et al. (2008)]. It should be noted we 
obtain the same regions of activation as Kober et al. (2008), but with signif- 
icantly smaller spatial extent. This is consistent with our simulation study, 
which shows how the kernel-based methods tend to overestimate the extent 
of activation. Finally, Figure 9 shows the posterior miscoding probabilities 
(thresholded at 0.10) for the same three slices. High miscoding probabilities 
indicate points that were deemed to be spurious activations and therefore 
given lower weights when calculating the posterior activation probabilities. 
Based on their locations, it appears that our method is providing an effective 
means of downweighting false activations. 

To see if the adaptive smoothing is preferred to the ordinary smoothing in 
this neuroimaging example, we conducted a test on Hq : jj^ = 1 using the de- 
viance information criterion (DIC) introduced by Spiegelhalter et al. (2002). 
More specifically, we first fitted to our imaging data the proposed adaptive 
GMRF model and a (nonadaptive) Bayesian thin-plate spline model (by fix- 
ing all 7jfc to be 1), and saved the MCMC posterior samples of both models. 
Then, we define the deviance as D(cf>) = — 21og(p(y|</>)), where p(y\(f>) is the 
likelihood function and cf> are unknown parameters of the model. The DIC 
score is finally estimated using DIC = 2D — D((f)), where D is calculated 
as the average of D{4>) over the samples of cf), and D(cf)) as the value of D 
evaluated at the average of the samples of cf). The model with smaller DIC 
should be in favor. Table 1 shows the DIC scores of the two models for the 
fixed x, y and z orientations, where the adaptive model is preferred in every 
scenario. 

5. Discussion. We developed a fully Bayesian method for nonparamet- 
ric binary regression and, together with a robustification procedure, applied 
it to meta-analysis in fMRI studies. Our analysis identified activated re- 
gions of the brain that are known to be associated with emotion. While 
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Table 1 

DIC scores of both adaptive and nonadaptive models 



for the 


fixed x, y and 


z orientation. 




Orientation 


X 


y 


z 


Adaptive 


9918.372 


8216.255 


9917.209 


Nonadaptive 


10,090.460 


9512.016 


9947.377 



similar regions were also identified in other meta-analyses such as Kober 
et al. (2008) that use kernel-based methods, our method has several ad- 
vantages over such approaches as follows. The adaptive GMRF used in our 
model better matches the natural spatial resolution of the data across the 
brain compared to using an arbitrary chosen fixed kernel size. This allows 
us to avoid the problem of overestimating regions of activation apparent in 
kernel-based methods. The Bayesian nature of our method allows for the 
construction of posterior probability maps indicating the probability of ob- 
serving a peak activation in response to the paradigm across the brain. This 
is in contrast to kernel methods which simply state that more peaks lie 
near the voxel than expected by chance. It should be noted that recently 
a Bayesian spatial hierarchical model using a marked independent cluster 
process [Kang et al. (2011)] was introduced for dealing with neuroimaging 
meta-analysis. In future work we will look at comparing this method with the 
nonparametric binary regression approach suggested in this paper. Finally, 
our procedure provides estimates of miscoding probabilities which can help 
to identify regions that may have been incorrectly tagged as being activated. 
This is another feature not provided by kernel-based methods. 

It is important to note that in this work the model setup assumes that the 
input data is two dimensional. Such 2D smoothing serves a useful purpose, as 
fMRI data are often analyzed either slice- wise or using cortical surface-based 
techniques [Dale, Fischl and Sereno (1999); Fischl, Sereno and Dale (1999)]. 
In reality, however, fMRI data are three dimensional in space. Therefore, 
it may ultimately be more appropriate to smooth the three spatial dimen- 
sions directly. We are actually working on such an extension of our current 
approach. The main computational constraint stems from inverting a large 
precision matrix, which is of 91 x 109 x 91 = 902,629 dimensions in our neu- 
roimaging example. We thus need a practical 3D GMRF, but the construc- 
tion is nontrivial. One possible solution is to obtain a highly sparse precision 
matrix by discretizing a 3D Laplacian operator with proper boundary con- 
ditions as we did in the 2D case. To achieve more computational efficiency, 
we may use a novel Bayesian inference tool similar to that introduced in 
Rue, Martino and Chopin (2009) rather than MCMC. 

As shown in the simulation studies, the results obtained by our method 
are somewhat sensitive to the prior miscoding probability r in the robus- 
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tification procedure. A large r may underestimate the activation clusters, 
while a small r tends to allow more false positives. The choice of r is often 
subjective. One may use information from, say, previous studies, to find an 
appropriate r in order to balance this trade-off. If no prior information is 
available, Choudhuri, Ghosal and Roy (2007) proposed letting r be a small 
number between 0.01 and 0.1. In practice, we suggest experimenting with 
several r values and choosing the one that gives the best results. 

APPENDIX: MCMC ALGORITHMS FOR POSTERIOR INFERENCE 

A.l. Probit link. Let y = (yi,. .. ,y n ) T be the random vector of binary 
observations measured and x = (x\, . . . , x n ) T the corresponding covariate 
values, where each Xi has one or two component variables. Let w = (w\, . . . , 
w n ) T be some unobservable latent variable. Following Holmes and Held 
(2006), the probit model can be written as 




1, if Wi > 0, 
0, if Wi < 0, 



(6) 

Wi = Zi + £i, ej'^'N^l), 

where z = (z\, . . . , z n ) T is the adaptive GMRF described in Section 2.1. 
Since yi are now deterministic conditional on the sign of the Wi, we have 
P{yi — 0|zj) = P{wi < 0|zj) = $(— Zi), where <!> is the standard Gaussian 
c.d.f. 

As mentioned earlier, the half-t prior on 5 can be written as 5 = \£\0, where 
£~N(0,1) and 6 2 ~ IG(p/2, pS 2 /2). A redundant multiplicative reparame- 
terization can be applied to model (6): 

fl, ifwi>0, 
Vl \ 0, if Wi < 0, 

Wi = €Vi + £h e^'^'N^l), 
where r] = (rfx, ... , rj n ) T has a GMRF prior density 

M# 2 ,7] « |^ 2 A 7 |y 2 exp(-^r7 T A 7 ^, 

with A 7 = B^diag(7)B m for m = 1,2. This expanded model form allows 
conditionally conjugate prior distributions for both £ and 9, and these pa- 
rameters are independent in the conditional posterior distribution [Gelman 
(2006); Gelman et al. (2008)]. Letting d be the dimension of the null space 
of A 7 , the full conditional distributions are listed below: 

• M^C^w) ~ Nn^,^), where n n = and *E V = (£ 2 I n + 
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• (£1*7) w ) ~N(/^,<t|), where /ig =a^rj'w and cr| = (1 + rfrj)~ 1 ; 

• (w\eny)~( N ^ mvH>0)l if2/j = 1 ' 

{w^,rj,y) j iV( ^. )1)/(u; .< 0)) ify . = . 
. ( 7i |e2,^)~lG(^,^ + |), where T7 = B m ry (m = l,2); 

Note that is a banded matrix and we can thus use the banded Cholesky 
decomposition to simulate rj with the cost of 0(n). The quantities it;, have 
independent truncated normal distributions and are also straightforward to 
sample from. 

A. 2. Logit link. Again, we use data augmentation and overparameteri- 
zation to write the logistic regression model as 

1, if Wi > 0, 
0, if Wi < 0, 

(7) Wi = ^Vi + £ h £i~N(0,Ai), 

Aj = (2Kj) 2 , Ki ~ KS, 

where KS denotes the Kolmogorov-Smirnov distribution [e.g., Devroye (1986)]. 
In this case, e, has the form of a scale mixture of normals with a marginal 
logistic distribution. 

To improve mixing of the Markov chains, we update {w, A} jointly given 

[w,A|£,T7,y] = [w|f,f/,y][A|w,£,Tf]. 

Letting A = diag(Ai, . . . , A n ), the posterior conditional distributions are as 
follows: 



V, 



(T7|6l 2 ,^,7,w, A) ~ N n (n v ,i: v ), where n v = ^S,Aw and S z = (£ 2 A + 

l 7 / 



(£\r], w, A) ~ N(^,<7 2 ), where ^ = ajri' Aw and cr| = (1 + 7/A77) x ; 

Logistic^, l)/(tOj >0), ify; = l, 
Logistic^;, l)I(wi < 0), if y. t = 0; 

• [A^K^] oc Xr 1 exp{-^-{ Wi - Zrn) 2 }KS{^y, 
. ( 7j |02^)^ IG (£±i i _i^^;, )) where ^ = Bm7? ( m = i,2); 

. (^| T/)7 )^ IG( n^ + £ ) l T/ / A7T? + £ |! ) . 

The Logistic(a, /3) denotes the density function of the logistic distribution 
with mean a and scale parameter /3 [Devroye (1986), page 39]. Sampling 
from the truncated logistic distribution can be done efficiently by the inver- 
sion method. Although it is not a standard task, sampling Aj is simple using 
a rejection method as outlined in Holmes and Held (2006). 
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A. 3. Other scale mixtures of normal links. The auxiliary variable sam- 
pling scheme described above can easily be generalized to work for any link 
function H that can be represented as scale mixtures of normal cdfs, and, 
hence, 



where v follows some continuous or discrete distribution G on (0, oo). A wide 
class of continuous, unimodal and symmetric distributions on the real line 
may be constructed as scale mixtures of normals. Many examples, such as 
discrete mixtures or contaminated normals, the Student t family, logistic, 
Laplace or double-exponential, and the stable family, are well known; see, 
for example, Andrews and Mallows (1974). 

Similarly, we introduce two sets of latent variables w = (wi, . . . , w n ) T and 

v = . . . , v n ) T such that (wi\z, v) ~ N(zj, v j), vi 1 ~ ' G, and yi = I(wi > 0). 
Then, conditional on z, the y^'s are independent Bernoulli random variables 
with success probability H(zi). Suppose G has a Lebesgue density or prob- 
ability mass function g. Let Zi = £r\i and V = diag(t>i, . . . , v n ). Then, the 
posterior conditional distributions are as follows: 

• (*7|0 2 ,£,7,w,v) ~ N n (/i^,E,), where n n = £V£,,w and T, v = (£ 2 V + 

• (£1^7) w ; v ) ~ N(^,cr|), where = a^r/Vw and cr| = (1 + r/Yrj) 1 ; 
{Wt\S,V,v,y) \N(£r^)^<0), if ^ = 0; 

• [vi\£,,wi,vi] cx ^ rl/2ex p{-2^(^ - im) 2 }g(vi); 

• ( 7i |0 2 ,77)~ 10(^,2^! + |), where f7 = B m r/ (m = l,2); 
. (^,7)^10(2^ + 1,1^^ + ^). 

Thus, a Gibbs sampler can be used to sample joint posterior distributions. 
The only difficult part is sampling 6i. For a Student t link, the mixing dis- 
tribution G is an inverse gamma distribution, as is the full conditional of 
each Vi. For the Laplace link, the G is an exponential distribution and the v~ l 
follows an inverse Gaussian conditional distribution. Therefore, one can di- 
rectly sample Vj's for those two links. If [vi\(;,Wi,r]i] does not correspond 
to any regular density, the samples may be drawn via acceptance-rejection 
sampling. 
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